Data Curation at Scale: The Data Tamer System
نویسندگان
چکیده
Data curation is the act of discovering a data source(s) of interest, cleaning and transforming the new data, semantically integrating it with other local data sources, and deduplicating the resulting composite. There has been much research on the various components of curation (especially data integration and deduplication). However, there has been little work on collecting all of the curation components into an integrated end-to-end system. In addition, most of the previous work will not scale to the sizes of problems that we are finding in the field. For example, one web aggregator requires the curation of 80,000 URLs and a second biotech company has the problem of curating 8000 spreadsheets. At this scale, data curation cannot be a manual (human) effort, but must entail machine learning approaches with a human assist only when necessary. This paper describes Data Tamer, an end-to-end curation system we have built at M.I.T. Brandeis, and Qatar Computing Research Institute (QCRI). It expects as input a sequence of data sources to add to a composite being constructed over time. A new source is subjected to machine learning algorithms to perform attribute identification, grouping of attributes into tables, transformation of incoming data and deduplication. When necessary, a human can be asked for guidance. Also, Data Tamer includes a data visualization component so a human can examine a data source at will and specify manual transformations. We have run Data Tamer on three real world enterprise curation problems, and it has been shown to lower curation cost by about 90%, relative to the currently deployed production software. This article is published under a Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0/), which permits distribution and reproduction in any medium as well allowing derivative works, provided that you attribute the original work to the author(s) and CIDR 2013. 6th Biennial Conference on Innovative Data Systems Research (CIDR ’13) January 6-9, 2013, Asilomar, California, USA.
منابع مشابه
Study of the foundation, models and issues of research data curation and management in scientific and academic environments
Background and Aim: The purpose of this paper is to study, identifying and discuss the foundation and concepts, models and frameworks, dimensions and challenges of research data curation and management in scientific and academic environments. Method: This article is a review article and library method was used to collect scientific and research texts in this field. In this research, external an...
متن کاملPower of Human Curation in Recommendation System
This paper introduces human curation signals and demonstrates incorporating human curation signals improves the relevance of state-of-art recommendation system models by up to 30% by experiments on a large-scale Pinterest dataset.
متن کاملCyberinfrastructure Supporting Evolving Data Collections
The requirements to support large-scale and complex research collections are growing at an accelerated pace. Considering the continuous evolution of the collections, their increasing sizes, the technologies supporting them, and the importance of adequate data management to long-term preservation, a team at the Texas Advanced Computing Center (TACC) developed a cyberinfrastructure to aid researc...
متن کاملOnline Scientific Data Curation, Publication, and Archiving
Science projects are data publishers. The scale and complexity of current and future science data changes the nature of the publication process. Publication is becoming a major project component. At a minimum, a project must preserve the ephemeral data it gathers. Derived data can be reconstructed from metadata, but metadata is ephemeral. Longer term, a project should expect some archive to pre...
متن کاملGEOMetaCuration: a web-based application for accurate manual curation of Gene Expression Omnibus metadata
Metadata curation has become increasingly important for biological discovery and biomedical research because a large amount of heterogeneous biological data is currently freely available. To facilitate efficient metadata curation, we developed an easy-touse web-based curation application, GEOMetaCuration, for curating the metadata of Gene Expression Omnibus datasets. It can eliminate mechanical...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2013